Semantic embedding based online cross-modal hashing method

Hashing has been extensively utilized in cross-modal retrieval due to its high efficiency in handling large-scale, high-dimensional data. However, most existing cross-modal hashing methods operate as offline learning models, which learn hash codes in a batch-based manner and prove to be inefficient for streaming data. Recently, several online cross-modal hashing methods have been proposed to address the streaming data scenario. Nevertheless, these methods fail to fully leverage the semantic information and accurately optimize hashing in a discrete fashion. As a result, both the accuracy and efficiency of online cross-modal hashing methods are not ideal. To address these issues, this paper introduces the Semantic Embedding-based Online Cross-modal Hashing (SEOCH) method, which integrates semantic information exploitation and online learning into a unified framework. To exploit the semantic information, we map the semantic labels to a latent semantic space and construct a semantic similarity matrix to preserve the similarity between new data and existing data in the Hamming space. Moreover, we employ a discrete optimization strategy to enhance the efficiency of cross-modal retrieval for online hashing. Through extensive experiments on two publicly available multi-label datasets, we demonstrate the superiority of the SEOCH method.


Semantic embedding based online cross-modal hashing method
Meijia Zhang 1,3 , Junzheng Li 2 & Xiyuan Zheng 1* Hashing has been extensively utilized in cross-modal retrieval due to its high efficiency in handling large-scale, high-dimensional data.However, most existing cross-modal hashing methods operate as offline learning models, which learn hash codes in a batch-based manner and prove to be inefficient for streaming data.Recently, several online cross-modal hashing methods have been proposed to address the streaming data scenario.Nevertheless, these methods fail to fully leverage the semantic information and accurately optimize hashing in a discrete fashion.As a result, both the accuracy and efficiency of online cross-modal hashing methods are not ideal.To address these issues, this paper introduces the Semantic Embedding-based Online Cross-modal Hashing (SEOCH) method, which integrates semantic information exploitation and online learning into a unified framework.To exploit the semantic information, we map the semantic labels to a latent semantic space and construct a semantic similarity matrix to preserve the similarity between new data and existing data in the Hamming space.Moreover, we employ a discrete optimization strategy to enhance the efficiency of cross-modal retrieval for online hashing.Through extensive experiments on two publicly available multi-label datasets, we demonstrate the superiority of the SEOCH method.
Recently, with the exponential growth of Internet usage, there has been a surge in information data.Traditional single retrieval methods are no longer sufficient to meet the increasing retrieval needs of individuals.Crossmodal retrieval, as a more effective and in-demand search method, has garnered significant research attention in today's society.Commonly used cross-modal retrieval methods [1][2][3][4] employ real-valued vectors to represent multimodal data.However, these methods require extensive computation and suffer from low efficiency.
The aforementioned methods all employ an offline learning model for batch-based training, which may fail to adapt to changing data and consequently reduce retrieval efficiency when faced with large volumes of streaming data.To address these limitations, several online hashing methods 25,26 have been proposed.Similar to offline hashing methods, online hashing methods can also be categorized as unsupervised or supervised.Unsupervised online hashing methods analyze the relationship between sample data, such as dimensionality reduction and the utilization of self-organizing mapping networks.Conversely, supervised online hashing methods often leverage label information to improve retrieval accuracy and mitigate the semantic gap problem.
Although numerous online cross-modal hashing methods have been proposed, existing approaches fail to fully exploit semantic information and accurately optimize hashing in a discrete manner.
To overcome these issues, we propose the Semantic Embedding-based Online Cross-modal Hashing (SEOCH) method, which integrates semantic information exploitation and online learning into a unified framework.To exploit semantic information, we map semantic labels to a latent semantic space and construct a semantic similarity matrix to preserve the similarity between new and existing data in the Hamming space.Moreover, we employ a discrete optimization strategy for online hashing.The main contributions of SEOCH are summarized as follows: • Subsequently, we construct a semantic similarity matrix to preserve the similarity between new and existing data in the Hamming space, thus mitigating the information loss that occurs when learning hash codes solely based on new data.• Additionally, we adopt a discrete optimization strategy for online hashing, which reduces quantization errors caused by relaxation-based optimization methods.
The remainder of this paper is organized as follows.We provides an overview of related work in cross-modal hashing methods in the first place.Then, our proposed method and training process are presented.Next, experimental results and corresponding analysis are presented.Finally, we summarize our work.

Related work
Numerous cross-modal hashing methods have emerged recently.Based on the utilization of semantic label information during the training process, these methods can be categorized into unsupervised and supervised approaches.
On the other hand, supervised hashing methods leverage semantic label information when learning hash codes.Examples include Semantic Correlation Maximization (SCM) 28 , Semantics-Preserving Hashing (SePH) 29 , Discriminant Cross-modal Hashing (DCH) 30 , Subspace Relation Learning for Cross-modal Hashing (SRLCH) 31 , and Semantic Topic Multimodal Hashing (STMH) 32 method.To take full advantage of heterogeneous correlation, many deep cross-modal retrieval methods have been proposed in recent years, such as references [33][34][35] .For instance, deep discrete cross-modal hashing with multiple supervision method 34 designs a semantic network to fully exploit the semantic information implicated in labels, which no longer focuses only on instance-pairwise and class-wise similarities, but also on instance-label level.
The aforementioned methods are all offline cross-modal retrieval models.However, in practical cross-modal retrieval applications, the input is typically in a streaming fashion.Consequently, several online methods have been proposed to cater to this scenario.In the online setting, as new data continuously arrives in a streaming manner, online methods solely utilize the newly arrived data to update the current model.This significantly reduces the computational complexity of the learning algorithm and the storage space requirements.Notable examples include Online Latent Semantic Hashing (OLSH) 25 and Online Collective Matrix Factorization Hashing (OCMFH) 26 , which have garnered increasing attention.Nevertheless, these methods fail to fully exploit semantic information and accurately optimize hashing in a discrete manner.

The proposed method Notation
In this paper, we consider a scenario where the number of image and text sample is equal.Let X = {x i } n i=1 ∈ R d x * n represents the image samples and Y = {y i } n i=1 ∈ R d y * n represents the text samples, where d x and d y denote the dimensions of the image and text modalities, respectively, and n is the number of samples.L = {0, 1 } ∈ R c * n is the label matrix, where c is the number of classes.If {x i , y i } belongs to the j-th class, l ji = 1 , otherwise l ji = 0 .B = {0, 1 } ∈ R k * n is the hash code matrix, where k represents the number of bits in the hash codes.
Suppose the training data is received in a streaming manner.At the t-th round, a new data chunk − → , where n t denotes the number of new data at t-th round.Correspondingly, i=1 n i is the number of the existing data before round t.The heterogeneous samples x i and y j are associated with similarity matrix S with its element s ij , where s ij = 1 means x i and y j share at least one common class label, and s ij = 0 means x i and y j do not share common class label.

Hash-code learning
To facilitate the online cross modal hashing, the overall objective function (i.e.Loss function ) can be written as: is the similarity matrix at round t, B(t) ∈ {0, 1} k * N t−1 denotes the hash codes of existing data, − → B (t) ∈ {0, 1} k * n t denotes the hash codes of new data.Q ∈ R g * k , P ∈ R g * c , U ∈ R k * d x and V ∈ R k * d y are four mapping matrices, g is the dimension of latent semantic concept space. 1 , 2 , 3 , α , β are five hyperparameters.
preserves the similarity between the new data and the existing data in the hamming space, which can solve the problem of information loss caused by learning hash codes only with new data. ( min

Training
The Semantic Embedding based Online Cross-modal Hashing (SEOCH) algorithm aims to optimize five variables.To address the objective in Eq. ( 1), an alternating learning strategy is employed, updating one variable at a time while keeping the others fixed.The entire training process is outlined below.

Update B (t)
By fixing all variables except B (t) , we can reformulate Eq. ( 1) as follows: Differentiating Eq. ( 2) with respect to B (t) and setting it to zero, we obtain: where I 1 ∈ R k * k denotes an identity matrix.To compute B(t) in Eq. ( 3), we follow the steps below.By fixing all variables except B(t) , we can reformulate Eq. ( 1) as follows: Differentiating Eq. ( 4) with respect to B(t) and setting it to zero, we obtain: where By differentiating Eq. ( 1) with respect to Q (t) , By setting Eq. ( 6) to zero, we have where By differentiating Eq. ( 1) with respect to P (t) , By setting Eq. ( 10) to zero, we have where min (5) By differentiating Eq. ( 1) with respect to U (t) , By setting Eq. ( 14) to zero, we have where By differentiating Eq. ( 1) with respect to V (t) , By setting Eq. ( 18) to zero, we have where I 4 ∈ R d y * d y is an identity matrix.

Out of sample
For a query that is not in the training set, we can generate the hash codes of a query point x q or y q as follows, To obtain a comprehensive overview, the complete learning algorithm of our proposed SEOCH is presented in Algorithm 1.

Algorithm 1.
The comprehensive learning algorithm of our proposed SEOCH

Experiments Datasets
In order to thoroughly assess the effectiveness of our approach, we conduct experiments on two publicly available multi-label datasets, namely the MIRFLICKR-25K dataset and the NUS-WIDE dataset.Detailed descriptions of these datasets are provided below.
The MIRFLICKR-25K dataset comprises 25,000 images with a total of 24 labels.Each image in this dataset is associated with one or more labels and connected to several textual tags.From this dataset, we randomly select 20,015 image-text pairs that possess at least 20 textual tags.Among these pairs, 2000 are chosen as queries and the remaining pairs form the training set.The image and text features used are 512-dimensional Scale-Invariant Feature Transform (SIFT) features and 1386-dimensional Bag of Words (BoW) features, respectively.To facilitate online cross-modal hashing, the training set is divided into 9 data chunks, with the first 8 chunks containing 2,000 instances each and the last chunk containing 2015 instances.
The NUS-WIDE dataset consists of approximately 270,000 images annotated with a total of 81 labels.For our experiments, we select 186,577 image-text pairs that are associated with at least one of the 10 most frequent concepts.Within the NUS-WIDE dataset, we randomly choose 1,867 pairs as queries, while the remaining pairs serve as the database.The image and text features in the database are represented by 500-dimensional Bag-of-Visual Words (BoVW) features and 1000-dimensional BoW features, respectively.Similar to the previous dataset, the training set is divided into 18 data chunks, with the first 17 chunks containing 10,000 instances each and the last chunk containing 14,710 instances to facilitate online cross-modal hashing.

Baselines and evaluated metrics
The proposed method is evaluated against six state-of-the-art cross-modal hashing methods, which can be categorized as follows: (1) offline methods: SCM-seq 28 , DCH 30 , SRLCH 31 , JIMFH 36 ; (2) online methods: OLSH 25 , OCMFH 26 .The source codes of these baselines are publicly available online, and the parameters are set based on the recommendations provided in the corresponding papers.In JIMFH, the mAP value is calculated with the number of query data set to 100.To ensure a fair comparison, we set the number of query data to 2,000 and 1,867 for the MIRFLICKR-25K and NUS-WIDE datasets, respectively.
Consistent with previous studies, we employ mean Average Precision (mAP) and Precision-Recall curves to evaluate the retrieval accuracy for two retrieval tasks: Image Retrieval Text (I2T) and Text Retrieval Image (T2I).

Experimental results and analysis
The mean Average Precision (mAP) scores of SEOCH and the comparison methods in the final round on the MIRFLICKR-25K and NUS-WIDE datasets are presented in Tables 1 and 2, respectively.Moreover, Figs. 1 and 2 display the mAP scores for each round of different methods in the two datasets, using 8-bit and 32-bit hash codes.
From the above results, it can be observed that: (1) The proposed SEOCH significantly outperforms all the offline baselines in almost all tasks, demonstrating its efficiency for streaming data scenarios.(2) The SEOCH outperforms the online baselines in most of the retrieval tasks, indicating the superiority of the semantic embeddingbased learning method.(3) The discrete methods, namely JIMFH, significantly outperform the relaxation-based methods, i.e., SCM-seq, validating that the discrete hashing methods are more effective for semantic similarity preservation.(4) With the increase of the code length, the performance of all methods is improved, which is  5) Compared with the 8-bit and 16-bit hash codes, the performance improvement of SEOCH is more significant when using longer codes (e.g., 32 bits or above), indicating its ability to exploit the semantic structure of the data in high-dimensional Hamming space.

Further analysis
Ablation study Moreover, three variations of SEOCH have been devised to assess the performance of the proposed method, as presented in Table 3. SEOCH-I sets 1 to 0. SEOCH-II sets 2 and 3 to 0. SEOCH -III eliminates the similarity matrix.From Table 3, it can be observed that for 8 bits, SEOCH-III achieves the lowest result; for 16, 32, and 64

Time cost analysis
To validate the efficiency of the proposed SEOCH, we conducted additional experiments on the MIRFLICKR-25K dataset to compare the training times of the baseline methods and SEOCH.In these experiments, we configured the hash code length to be 8 bits and 32 bits respectively.The training times of the two online methods under the same configurations are presented in Table 4.
From Table 4, it is evident that the proposed SEOCH not only achieves superior retrieval performance but also exhibits the shortest training time.Hence, the retrieval efficiency has been significantly enhanced.

Conclusion
This paper is focused on harnessing the semantic correlation between different modalities and enhancing the efficiency of cross-modal retrieval in online scenarios.In this paper, we propose an innovative approach called Semantic Embedding based Online Cross-modal Hashing (SEOCH).SEOCH integrates the exploitation of  semantic information and online learning into a unified framework.To leverage semantic information, we map semantic labels to a latent semantic space and construct a semantic similarity matrix to preserve the similarity between new and existing data in the Hamming space.Moreover, we employ a discrete optimization strategy for online hashing.Extensive experiments on two publicly available multi-label datasets validate the superiority of SEOCH.

Figure 2 .
Figure 2. The mAP scores at each round for two retrieval tasks on the NUS-WIDE dataset.

Table 1 .
The mAP scores of SEOCH and the comparison methods in the final round on the MIRFLICKR-25K dataset (Numbers in boldface indicate the highest scores).

Table 2 .
The mAP scores of SEOCH and the comparison methods in the final round on the NUS-WIDE dataset (Numbers in boldface indicate the highest scores)., SEOCH-II exhibits the lowest result; for 128 bits, SEOCH-I demonstrates the lowest result.Hence, it can be concluded that each component in our proposed SEOCH plays a significant role in the retrieval outcomes.

Table 3 .
Ablation study on the MIRFLICKR-25K dataset (The numbers in bold indicate the best performance).